Skip to content

Conversation

@vskogstad
Copy link
Contributor

Made some changes. Barely within 90 mins, so could be over on a different H100. 5 out of 90 minutes are spent validating against entire validation set.

Improvements
-Changed the data-loader from random, to randomized strided sampling without replacement. More unique training samples gave increased training loss(no repetitions) but brought the validation loss down to 3.126.
-Implemented gated attention (Sort of like in qwen next, but I do full gates instead of per head and use SILU instead of sigmoid). Changed to just 6 attention heads, which is a bit worse but gives higher MFU. In sum this brought validation loss down to 3.1035. https://arxiv.org/pdf/2505.06708
-QK-norm -> 3.094
-Adjusting AdamW params from [0.90, 0.95] to [0.9, 0.999] gave a surprisingly large boost -> 3.0839
-U-net architecture with learnable params. -> 3.0762

Other attempts
-Lowering the LR for the LM-head layer as recommended in Dion. NanoGPT seems to be doing it the other way around? Gave very minimal improvements. https://arxiv.org/pdf/2504.05295
-Document masking. Lower MFU and very slight performance decrease. I really expected this to work a lot better.
-Sliding window attention. Hybrid with 3 layers sliding window 1 full layer is only slightly worse, but no speedup
-Grouped query attention. Worse, as expected.
-Scaling output of each block like in Ernie. Worse. Think this might be interfering with my layernorm scaling.
-Mixing in extra embedding values in later value matrices like in NanoGPT speedrun. (Looks good 2/3 in but ends up worse in the end). General idea: https://arxiv.org/pdf/2410.17897
-Decreasing/increasing warmup steps or increasing learning rate after adding QK-norm gave no benefit.
-QK-clip. I was not able to get this to work. In theory it should help a bit with the MFU compared to QK-norm.

-Scaled down d_model from to 1024 -> 3.059
-Changed from Muon to NorMuon -> 3.0396.
-Finally got extra value embeddings mixed with the V-matrix to work. Got identical results using learnable scalar mixing of embeddings in only the final two layers as when mixing in all layers, so ended up going with that implementation. -> 3.032
-Increased the amount of training steps to the edge of the time-limit. -> 3.0305. 
I changed to measuring validation loss against a small subset of the validation set during training. I calculate validation loss on entire validation set after training instead, which is not included in the runtime. The run shown in the graph actually reached 3.0285 in validation loss, but to have some margin for worse GPU performance I have decreased my number of steps by 200 for my submission. This gives a validation loss to 3.0305.
@vskogstad
Copy link
Contributor Author

Made some more adjustments:
-Scaled down d_model from 1536 to 1024 and increased steps -> 3.059
-Changed from Muon to NorMuon -> 3.0396.
-Finally got extra value embeddings mixed with the V-matrix to work. Got identical results using learnable scalar mixing of embeddings in only the final two layers as when mixing in all layers, so ended up going with that implementation. -> 3.032
-Increased the amount of training steps to the edge of the time-limit. -> 3.0305.
I now measure validation loss against a small subset of the validation set during training. I calculate validation loss on entire validation set after training, which is not included in the runtime. The run shown in the graph actually reached 3.0285 in validation loss, but to have some margin for worse GPU performance I have decreased my number of steps by 200 for my submission. That gives validation loss of 3.0305.

@vskogstad
Copy link
Contributor Author

Using this thread as a way to log what I've tried. Mostly failed at reproducing promising papers:

  • Cautious weight decay: Only weight decay where the update and parameter signs aligns (looks super good early, loses out to baseline when LR decay hits, se plot.
  • SeedNorm: Basically identical to RMSNorm, but higher MFU. Maybe I am already getting the benefits by scaling the outputs by layer depth?
  • Skyladder: Tried with both linear and stepwise increase. Both give worse final loss.
  • Deepseek init: Changed linear layer initialization std to same as in deepseekV3. Again, looks good early and seems to be training fine, but ends up beeing worse in the end.
  • Scaleable softmax: Very slightly worse.
image

I've also been experimenting a bit with using other patterns for the skip-connections and tuning the initializations for the trainable scalars. Right now my first skip connection at the middle of the network basically acts as a multiplier but the network corrects it into a dampener, cutting the model in half. The scalars for my best run is:
[-0.9562, 0.0534, 0.0677, 0.1940, 0.3563, 0.6564]
I've tried removing the middle layer, removing the skip connection and using sigmoid gating to force positive contributions from skip connections. All give worse final loss. So I guess the model gets some benefit from it?

I tried to incorporate backout from all layers to find which one would be beneficial. Basically ended up with a garbled mess of positive and negative contributions. No clear trend as in NanoGPT speedrun. Overall, not sure if skip connections would scale or just yield better short-term loss.

Successes:
Based on CWD looking super good for so long, I attempted to modify LR to come into a range where CWD would be better than baseline. Accidentaly found a slightly better minimum LR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant